SpookyHash is a public domain noncryptographic hash function producing well-distributed 128-bit hash values for byte arrays of any length. It can produce 64-bit and 32-bit hash values too, at the same speed, just use the bottom n bits. The C++ reference implementation is specific to 64-bit x86 platforms, in particular it assumes the processor is little endian. Long keys hash in 3 bytes per cycle, short keys take about 1 byte per cycle, and there is a 30 cycle startup cost. Keys can be supplied in fragments. The function allows a 128-bit seed. It's named SpookyHash because it was released on Halloween.
C++ code for SpookyHash V1 (groundhog's day 2012):
C++ code for SpookyHash V2 (labor day 2012):
Both V1 and V2 pass all the tests. V2 corrects two oversights in V1:
Why use SpookyHash?
For long keys, the inner loop of SpookyHash is Spooky::Mix(). It consumes an 8-byte input, then does an xor, another xor, a rotation, and an addition. The internal state won't entirely fit in registers after 3 or 4 8-byte variables. But parallelism keeps increasing, and so does avalanche per pass. There was a sweet spot around 12 variables.
My previous hashes would have added 12 8-byte inputs to the 12 variables, then done a few passes through all the variables to mix them well, then added the next batch of inputs. That is tested by avalanche.html. Unlike my previous hashes, Spooky::Mix() is trickle-feed. It adds an input to one variable, does a little mixing, then adds data to the next variable. There is no bulk mixing stage in between groups of 12 variables; the processing treats all variables symmetrically. Trickle-feeding should allow chips to keep a static workload, doing a steady proportion of memory fetches, additions, XORs, and rotates simultaneously.
Testing a trickle-feed mix was a new problem. As always, it has to work equally well when run forward or backwards. As always, every input bit has to produce 128 bits of apparent entropy before all those bits could be canceled by other input bits. (128 bits, because the hash produces a 128-bit result.) My handwave is that after 12 8-byte inputs, there has been as much input as there is internal state, so anything could be overwritten. So if a delta appears to produce 128 bits of entropy by 12 8-byte inputs after the delta starts, that's good enough. I wrote screen.cpp to check for that.
I tried SSE2 instructions, and 64-bit multiplications, but it turned out that plain 64-bit rotate, addition, and XOR got results faster than those. I thought 4 or 6 variable schemes were going to win, but 12 variables won in the end. My boss suggests I look into GPUs. I haven't tried that yet. But given that the memory bandwidth is maxed out, I doubt they would help.
While Spooky::Mix() handles the middle of long keys well, it would need 4 repetitions for the final mix, and it has a huge startup cost. That would make short keys expensive. So I found the ShortHash to produce a 128-bit hash of short keys with little startup cost, and Spooky::End() to reduce the final mixing cost (this shows up most for keys just over 192 bytes long). Those components aren't trickle-feed, they work the old way. ShortHash is used automatically for short keys.
My frog test is the most stringent test I have for 64-bit and 128-bit noncryptogrpahic hashes. It hashes n-byte keys that are all zero except for m bits set, for all possible choices of m bits, and looks for collisions. Once it maxes out memory it keeps as many hashes as it can and continues churning through keys. Ideally you'd test a 128-bit hash by checking 2128 key-pairs, which should produce about 1 collision if the function is adequately random. That's way beyond my means. 272 key-pairs is about the limit of my testing abilities.
I have not tried the CRC32 instruction that started in the Nehalem Intel chips, because I don't have one. Google has, with CityHash. They claim 6 bytes per cycle, which is faster than any hash I've seen. On my machine CityHash about half the speed of SpookyHash; mine doesn't have a CRC32 instruction. CityHash passes my frog test for at least 272 keypairs.
The number of key-pairs covered scales linearly with the speed and number of CPUs and time dedicated to the task, and quadratically with the memory and disk that can be thrown at it. My current implementation of the frog test can use all a machine's cores, but it only uses RAM to hold the keys. I'm working on another version (named the hog test) that would use all the disk, RAM, and CPU you could throw at it. If you could produce 16 petabytes of 16-byte hashes, and sort them, and detect collisions, that would cover 2100 key-pairs. Sorting 16 terabytes of hashes would cover 280 key-pairs. This would look remarkably like the terasort benchmark, except it would be actually accomplishing something useful.